Bioimage analysis is central to researchers in biology and biomedical clinicians and physicians, to name a few. Since the start of the era of "big data," it has become increasingly easier to capture and store large quantities of image data. However, the unique domain-specific properties of images makes large-scale analysis of images difficult.
A common operation in image analysis is to find the principal components of the images, perhaps for clustering and segmentation or for making predictions about unobserved data. This operation is typically very expensive, and does not scale beyond medium-sized datasets in its "traditional" form. Instead, we have to look to distributed methods to conduct similar analyses. This brief demo shows how we could pick out the top 10 principal components from the entire FERET image dataset.
For completeness, we'll list here the steps we took to prepare the data and create the AWS cluster on which we'll be performing this demo.
awscli, and can be installed using the operating system package manager (e.g. apt-get install awscli) or with Python's package manager (e.g. pip install awscli).Create an S3 bucket on AWS.
aws s3 mb s3://mybucket --region us-east-1
Upload all the image data to the S3 bucket.
aws s3 cp local/path/to/images s3://mybucket/path --recursive
Install thunder-python.
pip install thunder-python
Use thunder-python's EC2 scripts to provision a compute cluster on AWS. There is great documentation available on the project webpage to help. http://thunder-project.org/thunder/docs/install_ec2.html
IMPORTANT: In the final step 6 above, make sure you use the option --copy-aws-credentials when provisioning the cluster. This will give your cluster direct access to S3. Otherwise you will need to copy your AWS access keys manually.
In [1]:
# IPython Notebook preliminaries.
%matplotlib inline
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("notebook")
from thunder import Colorize
image = Colorize.image
Now that all the setup is done, we're going to read in the image data we deposited earlier on S3 into a distributed data structure. If we were using vanilla Spark, this might be an entire operation unto itself, despite Spark's relatively high level of abstraction. However, since we're using thunder-python, it is built specifically to handle image and time series data, so it implements utility methods that make it very simple to read large image datasets into a robust distributed data structure.
In [3]:
# This pulls in all the .tif files at the S3 path and puts them into a Series data structure--an
# array-like entity that is distributed across the cluster.
images = tsc.loadImagesAsSeries("s3://squinn-bsec2015/feret", inputFormat = "tif")
images.cache()
Out[3]:
That's it! We've read all the images into a Series, a distributed 1-dimensional array of heterogeneous data. Now we can ask basic questions such as, how many images are we looking at? What does an image look like?
In [4]:
# How many images are we looking at?
print images.nrecords
In [5]:
# Let's take a look at the first image.
i = images.toImages().first()[1]
image(i)
In [6]:
# Maybe another image, too?
i = images.toImages().get(2)
image(i)
We have our image dataset of faces. Let's now compute the first 10 principal components, using a distributed SVD solver. We can do this by "stacking" all 98,304 images into a matrix, which is then decomposed. In practice, the images never leave whatever machine they're on--rather, the code is sent over the network and each image is annotated with its effective "row" in the matrix, which is used later to perform the actual SVD. All these details are hidden from the user.
In [7]:
# Import the PCA method. (http://thunder-project.org/thunder/docs/tutorials/factorization.html#pca)
from thunder import PCA
# Initialize our PCA model with k = 10 components, and then fit that model to our dataset.
model = PCA(k = 10).fit(images)
That took about an hour to run on all the images with 15 AWS instances. Here we'll reproduce the top 10 principal components.
In [8]:
components = model.scores.pack()
components.shape
Out[8]:
As we see from the shape of the resulting components, we have 10 components, each 384 by 256 pixels. We'll plot each of those now.
In [9]:
for img in range(components.shape[0]):
plt.subplot(2, 5, img)
image(components[img])
In [ ]: